11,051 research outputs found

    Transformation of alignment files improves performance of variant callers for long-read RNA sequencing data

    Full text link
    Long-read RNA sequencing (lrRNA-seq) produces detailed information about full-length transcripts, including novel and sample-specific isoforms. Furthermore, there is an opportunity to call variants directly from lrRNA-seq data. However, most state-of-the-art variant callers have been developed for genomic DNA. Here, there are two objectives: first, we perform a mini-benchmark on GATK, DeepVariant, Clair3, and NanoCaller primarily on PacBio Iso-Seq, data, but also on Nanopore and Illumina RNA-seq data; second, we propose a pipeline to process spliced-alignment files, making them suitable for variant calling with DNA-based callers. With such manipulations, high calling performance can be achieved using DeepVariant on Iso-seq data

    abSNP: RNA-Seq SNP Calling in Repetitive Regions via Abundance Estimation

    Get PDF
    Variant calling, in particular, calling SNPs (Single Nucleotide Polymorphisms) is a fundamental task in genomics. While existing packages offer excellent performance on calling SNPs which have uniquely mapped reads, they suffer in loci where the reads are multiply mapped, and are unable to make any reliable calls. Variants in multiply mapped loci can arise, for example in long segmental duplications, and can play important role in evolution and disease. In this paper, we develop a new SNP caller named abSNP, which offers three innovations. (a) abSNP calls SNPs from RNA-Seq data. Since RNA-Seq data is primarily sampled from gene regions, this method is inexpensive. (b) abSNP is able to successfully make calls on repetitive gene regions by exploiting the quality scores of multiply mapped reads carefully in order to make variant calls. (c) abSNP exploits a specific feature of RNA-Seq data, namely the varying abundance of different genes, in order to identify which repetitive copy a particular read is sampled from. We demonstrate that the proposed method offers significant performance gains on repetitive regions in simulated data. In particular, the algorithm is able to achieve near-perfect sensitivity on high-coverage SNPs, even when multiply mapped

    Development of a reproducible transcriptomics variant calling workflow and its application to colorectal cancer : a thesis presented in partial fulfilment of the requirements for the degree of Master of Science in Genetics at Massey University, Albany, New Zealand

    Get PDF
    Appendix 1 is not publically available.Colorectal cancer (CRC) is one of the most common cancers worldwide. It has some of the highest rates in New Zealand, exacerbated by short-comings in available diagnostic tools and survival discrepancies between Maori and non-Maori demographics. In this project, a bioinformatics workflow was developed to make “high confidence” single nucleotide polymorphism (SNP) variant calls from transcriptomics/RNA-seq data. While calling variants from whole genome and exome sequencing is common, standard workflows for calling variants from RNA-seq data do not exist. Here, we aimed to use two common RNA-seq pre-processing methods which we then complemented with an ensemble of variant calling tools, improving confidence in any variants called. We then applied this pipeline to two independent CRC datasets with the hope that those variant calls could improve our understanding of the disease, one of the most significant aggregators of cancer-related mortality. Variant calls were made including those with clinical implications, such as the same KRAS gene variant being called between both geographically distinct populations. Multiple “novel” variants, or those lacking clinically significant annotations, were also obtained for known oncogenic targets (e.g. MAPK1 and AKT1). RNA-seq variant calling remains problematic. The results of this study have provided us with some direction and considerations for future work, such as including normal samples to better distinguish between germline and somatic variants, permit the use of more somatic variant calling tools, etc. Future work is also needed to understand how or if those novel variant calls could improve our understanding of CR

    Development of two new approaches for NGS data analysis of DNA and RNA molecules and their application in clinical and research fields

    Get PDF
    The aim of this study is focused on two main areas of NGS analysis data: RNA-seq(with a specific interest in meta-transcriptomics) and DNA somatic mutations detection. We developed a simple and efficient pipeline for the analysis of NGS data derived from gene panels to identify DNA somatic point mutations. In particular we optimized a somatic variant calling procedure that was tested on simulated datasets and on real data. The performance of our system has been compared with currently available tools for variant calling reviewed in literature. For RNA-seq analysis, in this work we tested and optimized STAble, an algorithm developed originally in our laboratory for the de novo reconstruction of transcripts from non reference based RNA-seq data. At the beginning of this study, the first module of STAble was already been written. The first module is the one which reconstructs a list of transcripts starting from RNA-seq data. The aim of this study, particularly, consisted in adding a new module to STAble, developed in collaboration with Cambridge University, based on the flux-balance analysis in order to link the metatranscriptomic analysis to a metabolic approach. This goal has been achieved in order to study the metabolic fluxes of microbiota starting from metatranscriptomic data

    Indexing arbitrary-length kk-mers in sequencing reads

    Full text link
    We propose a lightweight data structure for indexing and querying collections of NGS reads data in main memory. The data structure supports the interface proposed in the pioneering work by Philippe et al. for counting and locating kk-mers in sequencing reads. Our solution, PgSA (pseudogenome suffix array), based on finding overlapping reads, is competitive to the existing algorithms in the space use, query times, or both. The main applications of our index include variant calling, error correction and analysis of reads from RNA-seq experiments

    The impact of RNA sequence library construction protocols on transcriptomic profiling of leukemia

    Get PDF
    Background: RNA sequencing (RNA-seq) has become an indispensable tool to identify disease associated transcriptional profiles and determine the molecular underpinnings of diseases. However, the broad adaptation of the methodology into the clinic is still hampered by inconsistent results from different RNA-seq protocols and involves further evaluation of its analytical reliability using patient samples. Here, we applied two commonly used RNA-seq library preparation protocols to samples from acute leukemia patients to understand how poly-A-tailed mRNA selection (PA) and ribo-depletion (RD) based RNA-seq library preparation protocols affect gene fusion detection, variant calling, and gene expression profiling. Results: Overall, the protocols produced similar results with consistent outcomes. Nevertheless, the PA protocol was more efficient in quantifying expression of leukemia marker genes and showed better performance in the expression-based classification of leukemia. Independent qRT-PCR experiments verified that the PA protocol better represented total RNA compared to the RD protocol. In contrast, the RD protocol detected a higher number of non-coding RNA features and had better alignment efficiency. The RD protocol also recovered more known fusion-gene events, although variability was seen in fusion gene predictions. Conclusion: The overall findings provide a framework for the use of RNA-seq in a precision medicine setting with limited number of samples and suggest that selection of the library preparation protocol should be based on the objectives of the analysis.Peer reviewe

    A computational method for estimating the PCR duplication rate in DNA and RNA-seq experiments.

    Get PDF
    BackgroundPCR amplification is an important step in the preparation of DNA sequencing libraries prior to high-throughput sequencing. PCR amplification introduces redundant reads in the sequence data and estimating the PCR duplication rate is important to assess the frequency of such reads. Existing computational methods do not distinguish PCR duplicates from "natural" read duplicates that represent independent DNA fragments and therefore, over-estimate the PCR duplication rate for DNA-seq and RNA-seq experiments.ResultsIn this paper, we present a computational method to estimate the average PCR duplication rate of high-throughput sequence datasets that accounts for natural read duplicates by leveraging heterozygous variants in an individual genome. Analysis of simulated data and exome sequence data from the 1000 Genomes project demonstrated that our method can accurately estimate the PCR duplication rate on paired-end as well as single-end read datasets which contain a high proportion of natural read duplicates. Further, analysis of exome datasets prepared using the Nextera library preparation method indicated that 45-50% of read duplicates correspond to natural read duplicates likely due to fragmentation bias. Finally, analysis of RNA-seq datasets from individuals in the 1000 Genomes project demonstrated that 70-95% of read duplicates observed in such datasets correspond to natural duplicates sampled from genes with high expression and identified outlier samples with a 2-fold greater PCR duplication rate than other samples.ConclusionsThe method described here is a useful tool for estimating the PCR duplication rate of high-throughput sequence datasets and for assessing the fraction of read duplicates that correspond to natural read duplicates. An implementation of the method is available at https://github.com/vibansal/PCRduplicates

    VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research

    Get PDF
    Accurate variant calling in next generation sequencing (NGS) is critical to understand cancer genomes better. Here we present VarDict, a novel and versatile variant caller for both DNA- and RNA-sequencing data. VarDict simultaneously calls SNV, MNV, InDels, complex and structural variants, expanding the detected genetic driver landscape of tumors. It performs local realignments on the fly for more accurate allele frequency estimation. VarDict performance scales linearly to sequencing depth, enabling ultra-deep sequencing used to explore tumor evolution or detect tumor DNA circulating in blood. In addition, VarDict performs amplicon aware variant calling for polymerase chain reaction (PCR)-based targeted sequencing often used in diagnostic settings, and is able to detect PCR artifacts. Finally, VarDict also detects differences in somatic and loss of heterozygosity variants between paired samples. VarDict reprocessing of The Cancer Genome Atlas (TCGA) Lung Adenocarcinoma dataset called known driver mutations in KRAS, EGFR, BRAF, PIK3CA and MET in 16% more patients than previously published variant calls. We believe VarDict will greatly facilitate application of NGS in clinical cancer research

    Selection for improved energy use efficiency and drought tolerance in canola results in distinct transcriptome and epigenome changes

    Get PDF
    To increase both the yield potential and stability of crops, integrated breeding strategies are used that have mostly a direct genetic basis, but the utility of epigenetics to improve complex traits is unclear. A better understanding of the status of the epigenome and its contribution to agronomic performance would help in developing approaches to incorporate the epigenetic component of complex traits into breeding programs. Starting from isogenic canola (Brassica napus) lines, epilines were generated by selecting, repeatedly for three generations, for increased energy use efficiency and drought tolerance. These epilines had an enhanced energy use efficiency, drought tolerance, and nitrogen use efficiency. Transcriptome analysis of the epilines and a line selected for its energy use efficiency solely revealed common differentially expressed genes related to the onset of stress tolerance-regulating signaling events. Genes related to responses to salt, osmotic, abscisic acid, and drought treatments were specifically differentially expressed in the drought-tolerant epilines. The status of the epigenome, scored as differential trimethylation of lysine-4 of histone 3, further supported the phenotype by targeting drought-responsive genes and facilitating the transcription of the differentially expressed genes. From these results, we conclude that the canola epigenome can be shaped by selection to increase energy use efficiency and stress tolerance. Hence, these findings warrant the further development of strategies to incorporate epigenetics into breeding

    Bioinformatics tools for analysing viral genomic data

    Get PDF
    The field of viral genomics and bioinformatics is experiencing a strong resurgence due to high-throughput sequencing (HTS) technology, which enables the rapid and cost-effective sequencing and subsequent assembly of large numbers of viral genomes. In addition, the unprecedented power of HTS technologies has enabled the analysis of intra-host viral diversity and quasispecies dynamics in relation to important biological questions on viral transmission, vaccine resistance and host jumping. HTS also enables the rapid identification of both known and potentially new viruses from field and clinical samples, thus adding new tools to the fields of viral discovery and metagenomics. Bioinformatics has been central to the rise of HTS applications because new algorithms and software tools are continually needed to process and analyse the large, complex datasets generated in this rapidly evolving area. In this paper, the authors give a brief overview of the main bioinformatics tools available for viral genomic research, with a particular emphasis on HTS technologies and their main applications. They summarise the major steps in various HTS analyses, starting with quality control of raw reads and encompassing activities ranging from consensus and de novo genome assembly to variant calling and metagenomics, as well as RNA sequencing
    • …
    corecore